Descriptive Statistics

Md Zulquar Nain

Importing the Data File

# importing data from `csv` file
datai <- read.csv("hsbraw.csv")
  • datai - name of the imported data file inR

  • hsbraw.csv name of the csv file being imported

Exploring the Dataset I

  • Class, structure and dimension of the dataset
# Structure of the data
str(datai)
'data.frame':   189 obs. of  9 variables:
 $ id     : int  3 4 5 6 7 8 9 10 11 12 ...
 $ gender : chr  "male" "female" "male" "female" ...
 $ schtyp : chr  "public" "public" "public" "public" ...
 $ prog   : chr  "academic" "academic" "academic" "academic" ...
 $ read   : int  63 44 47 47 57 39 48 47 34 37 ...
 $ write  : int  65 50 40 41 54 44 49 54 46 44 ...
 $ math   : int  48 41 43 46 59 52 52 49 45 45 ...
 $ science: int  63 39 45 40 47 44 -99 53 39 39 ...
 $ socst  : int  56 51 31 41 51 48 -99 61 36 46 ...
#Class of the data
class(datai)
[1] "data.frame"
# Dimension of the data
dim(datai)
[1] 189   9

Exploring the Dataset II

  • First n rows of observations of the data set
    • head(data.frame name, n)
  • Last n rows of observations of the data set
    • tail(data.frame name, n)
# View top two rows of the data
head(datai,2)
  id gender schtyp     prog read write math science socst
1  3   male public academic   63    65   48      63    56
2  4 female public academic   44    50   41      39    51
# View bottom two rows 
tail(datai,2)
     id gender  schtyp     prog read write math science socst
188 199   male private academic   52    59   50      61    61
189 200   male private academic   68    54   75      66    66

Descriptive Statistics

Measures of Central tendency

  • For continuous variables are the
    • mean, median, and variance
  • Functions in R
    • mean()
    • median()
    • var()
    • sd() for standard deviation

Descriptive statistics

  • summary() function available with base R
  • mean,median,25th and 75th quartiles
  • min and max
summary(datai)
       id           gender             schtyp              prog          
 Min.   :  3.0   Length:189         Length:189         Length:189        
 1st Qu.: 52.0   Class :character   Class :character   Class :character  
 Median :101.0   Mode  :character   Mode  :character   Mode  :character  
 Mean   :101.7                                                           
 3rd Qu.:152.0                                                           
 Max.   :200.0                                                           
      read           write            math          science     
 Min.   :28.00   Min.   :31.00   Min.   :35.00   Min.   :-99.0  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00   1st Qu.: 44.0  
 Median :52.00   Median :54.00   Median :53.00   Median : 53.0  
 Mean   :52.99   Mean   :53.67   Mean   :53.35   Mean   : 47.7  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00   3rd Qu.: 58.0  
 Max.   :76.00   Max.   :67.00   Max.   :75.00   Max.   : 74.0  
     socst      
 Min.   :-99.0  
 1st Qu.: 46.0  
 Median : 52.0  
 Mean   : 48.1  
 3rd Qu.: 61.0  
 Max.   : 71.0  

Descriptive statistics

  • Selecting a specific column
summary(datai$read)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28.00   47.00   52.00   52.99   60.00   76.00 
  • More than one column
# create a subdata 
subdata <- datai[,c("read","write","math")]
# summary of the subset
summary(subdata)
      read           write            math      
 Min.   :28.00   Min.   :31.00   Min.   :35.00  
 1st Qu.:47.00   1st Qu.:46.00   1st Qu.:46.00  
 Median :52.00   Median :54.00   Median :53.00  
 Mean   :52.99   Mean   :53.67   Mean   :53.35  
 3rd Qu.:60.00   3rd Qu.:61.00   3rd Qu.:60.00  
 Max.   :76.00   Max.   :67.00   Max.   :75.00  

Descriptive Statistics

  • Using describe function form the psych package

  • More control

 library(psych)

describe(subdata)
      vars   n  mean   sd median trimmed   mad min max range  skew kurtosis
read     1 189 52.99 9.94     52   52.78 11.86  28  76    48  0.20    -0.65
write    2 189 53.67 8.90     54   54.22 10.38  31  67    36 -0.52    -0.64
math     3 189 53.35 9.10     53   52.99 10.38  35  75    40  0.27    -0.67
        se
read  0.72
write 0.65
math  0.66

Descriptive Statistics

  • Without Skewness and Kurtosis
# without skewness and kurtosis
describe(subdata, skew=FALSE) 
      vars   n  mean   sd median min max range   se
read     1 189 52.99 9.94     52  28  76    48 0.72
write    2 189 53.67 8.90     54  31  67    36 0.65
math     3 189 53.35 9.10     53  35  75    40 0.66
  • Without range
 # without range
describe(subdata, ranges = FALSE)
      vars   n  mean   sd  skew kurtosis   se
read     1 189 52.99 9.94  0.20    -0.65 0.72
write    2 189 53.67 8.90 -0.52    -0.64 0.65
math     3 189 53.35 9.10  0.27    -0.67 0.66

Descriptive Statistics

  • Summary Statistics by grouping data using some specific criteria
# generating summary statistics by grouping variable
describeBy(subdata, datai$schtyp)

 Descriptive statistics by group 
group: private
      vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
read     1 32 54.25 9.20   52.0   53.85 7.41  36  73    37  0.32    -0.91 1.63
write    2 32 55.53 7.18   57.0   56.12 6.67  38  67    29 -0.70    -0.26 1.27
math     3 32 54.75 8.88   53.5   54.27 8.90  41  75    34  0.45    -0.69 1.57
------------------------------------------------------------ 
group: public
      vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
read     1 157 52.73 10.09     52   52.53 11.86  28  76    48  0.19    -0.66
write    2 157 53.29  9.18     54   53.81 11.86  31  67    36 -0.46    -0.77
math     3 157 53.07  9.15     53   52.72 10.38  35  75    40  0.25    -0.73
        se
read  0.81
write 0.73
math  0.73

Frequencies and Cross Tabulation

Frequency Table

  • Generating Frequency Tables
  • frequency tables using the table( ) function
table(datai$gender)

female   male 
   104     85 
table(datai$schtyp)

private  public 
     32     157 

Frequency Tables

  • tables of proportions using the prop.table( ) function
  • for proportions, use output of table() as input to prop.table()
#saving the freq table to an object
tableg <- table(datai$gender)
prop.table(tableg)

   female      male 
0.5502646 0.4497354 
# OR
prop.table(table(datai$gender))

   female      male 
0.5502646 0.4497354 

Cross Tabulation

  • Two Way Tabulation

  • counts in each crossing of gender and school type

tab2way <- table(datai$gender, datai$schtyp)
tab2way
        
         private public
  female      18     86
  male        14     71
  • Marginal frequencies using margin.table( )
margin.table(tab2way,margin = 1)

female   male 
   104     85 
margin.table(tab2way,margin = 2)

private  public 
     32     157 

Proportion Table

  • Row proportions
  • Proportion of gender that falls into school type
prop.table(tab2way, margin = 1)
        
           private    public
  female 0.1730769 0.8269231
  male   0.1647059 0.8352941
  • columns proportions,
  • Proportion of school type that falls into gender
prop.table(tab2way, margin = 2)
        
           private    public
  female 0.5625000 0.5477707
  male   0.4375000 0.4522293

Correlation

Correlation

  • Can use the cor( ) function to produce correlations

  • General framework cor(x, use=, method= )

    • x: Matrix or data frame
    • use: Specifies the handling of missing data
    • method: Specifies the type of correlation
cordata <- datai[,c("read","write","math")]
cor(cordata,use="all.obs",method="pearson")
           read     write      math
read  1.0000000 0.5613371 0.6373328
write 0.5613371 1.0000000 0.5789356
math  0.6373328 0.5789356 1.0000000
cor(cordata,use="all.obs",method="spearman")
           read     write      math
read  1.0000000 0.5825882 0.6355307
write 0.5825882 1.0000000 0.6117342
math  0.6355307 0.6117342 1.0000000
cor(cordata,use="all.obs",method="kendall")
           read     write      math
read  1.0000000 0.4264602 0.4719542
write 0.4264602 1.0000000 0.4492563
math  0.4719542 0.4492563 1.0000000
  • all.obs assumes no missing data - missing data will produce an error
  • complete.obs-listwise deletion
  • pairwise.complete.obs- pairwise deletion

THANKS